sparse training
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Vision (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
A Single-Step, Sharpness-Aware Minimization is All You Need to Achieve Efficient and Accurate Sparse Training
However, the training of a sparse DNN encounters great challenges in achieving optimal generalization ability despite the efforts from the state-of-the-art sparse training methodologies. To unravel the mysterious reason behind the difficulty of sparse training, we connect network sparsity with the structure of neural loss functions and identify that the cause of such difficulty lies in a chaotic loss surface.
- North America > United States (0.46)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Europe > Italy (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > Montserrat (0.04)
- (5 more...)
- North America > United States > California (0.14)
- North America > United States > Michigan (0.04)
Top-KAST: Top-K Always Sparse Training
Sparse neural networks are becoming increasingly important as the field seeks to improve the performance of existing models by scaling them up, while simultaneously trying to reduce power consumption and computational footprint. Unfortunately, most existing methods for inducing performant sparse models still entail the instantiation of dense parameters, or dense gradients in the backward-pass, during training. For very large models this requirement can be prohibitive. In this work we propose Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes). We demonstrate the efficacy of our approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity. In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling where the current best performing architectures tend to have tens of billions of parameters and scaling up does not yet seem to have saturated performance. Sparse versions of these architectures can be run with significantly fewer resources, making them more widely accessible and applicable. Furthermore, in addition to being effective, our approach is straightforward and can easily be implemented in a wide range of existing machine learning frameworks with only a few additional lines of code. We therefore hope that our contribution will help enable the broader community to explore the potential held by massive models, without incurring massive computational cost.